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1  Research  Accomplishments 


Following  is  the  abstract  of  the  research  proposal  that  led  to  funding  of  the  research 
being  reported  here.  It  states  the  research  objectives. 


The  aim  of  the  proposed  research  is  to  extend  experience  with  a  particular 
approach  to  learning  in  networks  of  neuron-like  adaptive  elements.  The  long¬ 
term  goal  of  this  research  is  the  development  of  massively  parallel  adaptive 
systems  that  incorporate  principles  of  operation  suggested  by  nervous  sys¬ 
tems.  It  is  an  alternative  to  the  knowledge-based  approach  of  conventional 
Artificial  Intelligence.  The  chief  characteristic  of  our  approach  is  that  each 
network  component  is  a  self-interested  agent  that  attempts  to  learn,  via  a  re¬ 
inforcement  learning  process,  how  to  obtain  its  most  highly  preferred  inputs. 

These  components  implement  a  novel  learning  rule  previously  developed  by  us 
that  causes  them  to  learn  to  enter  into  cooperative  interactions  with  one  an¬ 
other  for  mutual  benefit.  A  significant  consequence  of  this  type  of  interaction 
is  that  layered  networks  of  these  elements  can  learn  to  implement  nonlin¬ 
ear  associative  mappings  by  constructing  the  necessary  representations.  This 
method  of  learning  nonlinear  associative  mappings  is  therefore  one  of  several 
recently  developed  by  various  reseach  groups  that  promises  to  greatly  extend 
the  power  of  adaptive  networks.  We  propose  to  exploit  the  capabilities  of 
this  method  in  control  tasks  related  to  the  following  problems-.  I)  the  devel¬ 
opment  of  coordinative  structures  in  motor  control,  and  2)  the  learning  of 
strategies  and  representations  for  guiding  movement  in  space.  In  the  context 
of  these  domains,  we  propose  to  investigate  Itow  well  these  algorithms  scale  up 
to  larger  networks,  how  their  performance  can  be  improved,  and  how  their 
novel  capabilities  can  extend  current  abilities  to  control  complex  systems. 

The  proposed  methodology  relies  on  computer  simulation  and  mathematical 
analysis. 

We  have  made  substantial  progress  in  the  following  areas:  1)  the  refinement  of  rein¬ 
forcement  learning  methods  for  nonlinear  pattern  recognition  and  associative  memory,  2) 
the  development  of  adaptive  network  methods  applicable  to  problems  in  motor  control 
and  robotics,  3)  the  development  of  a  theoretical  perspective  on  scaling  up  network  learn¬ 
ing  algorithms,  4)  the  investigation  of  methods  for  accelerating  convergence  of  learning 
methods  by  adaptively  altering  learning  rate  parameters,  ft )  t  he  development  < >f  a  modular 
network  architecture  and  learning  method,  and  6)  the  refinement  of  reinforcement  learn¬ 
ing  methods  for  control  of  dynamical  systems.  I  discuss  each  of  these  areas  of  progress 
below.  Several  additional  topics  are  discussed  which  were  not  as  well  developed  as  those 
mentioned  above  at  the  end  of  the  funding  period.  Most  of  tin'  research  conducted  under 
this  grant  has  been  described  in  detail  in  published  technical  reports,  conference  papers, 
and  journal  articles.  In  the  sections  that  follow,  the  topics  on  which  we  have  published 
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are  treated  with  less  detail  than  those  on  which  we  have  not  yet  published,  and  references 
to  the  appropriate  published  material  are  provided. 

Although  we  believe  that  all  of  this  progress  has  been  significant,  we  believe  that 
some  of  the  research  is  truly  outstanding,  having  already  made  a  considerable  impact 
on  the  field  of  connectionist  computing.  Specifically,  the  theoretical  results  achieved  by 
J.  S.  Judd,  supported  as  a  graduate  student  by  the  grant,  on  the  complexity  of  network 
learning  is  having  considerable  influence,  as  is  the  research  of  M.  I.  Jordan,  supported  by 
the  grant  as  a  post-doctoral  researcher,  on  supervised  learning  for  systems  with  excess 
degrees  of  freedom. 


2  Refinement  of  Reinforcement  Learning  Methods 

Substantial  progress  was  made  in  refining  and  exploring  the  utility  ot  neuron-like  units 
that  attempt  to  maximize  an  evaluation,  or  reinforcement,  signal.  Unlike  most  of  the 
neuron-like  adaptive  units  being  studied  by  others,  such  units  do  not  require  direct  speci¬ 
fication  of  target,  or  desired,  outputs.  In  this  section,  we  focus  on  reinforcement  learning 
methods  as  applied  to  classification  tasks  and  static  decision  tasks.  In  Section  7  we  dis¬ 
cuss  more  recent  progress  made  in  understanding  reinforcement  learning  as  an  approach 
to  the  problem  of  controlling  dynamical  systems. 

2.1  The  Associative  Reward-Penalty  (Ar„p  )  Unit 

An  unit  is  a  neuron-like  unit  having  a  stochastic  binary  output  and  a  learning 

rule  for  adjusting  its  connection  weights  so  as  to  maximize  the  probability  with  which  it 
receives  an  evaluation  or  reinforcement  signal  indicating  “reward”  or  “success”  (Figure 
2.1).  It  is  one  embodiment  of  the  idea  of  a  “heterostat”  put  forward  by  Klopf  (22,  23]: 
a  neuron-like  unit  which  tries  maximize  some  local  form  of  “pleasure”.  The  Ap_p  unit 
is  the  most  successful  of  our  efforts  to  develop  Klopf’s  idea  into  a  concrete  implementa¬ 
tion.  Barto  and  Anandan  [4]  described  the  learning  rule  and  proved  that  it  maximizes 
success  probability  under  certain  conditions.  In  previous  research,  we  found  that  lay¬ 
ered  networks  of  A/j_p  units  could  reliably  learn  to  solve  nonlinear  pattern  classification 
and  associative  memory  tasks  [2,  1],  After  we  developed  the  Ap_P  unit  and  applied  it 
to  the  problem  of  learning  of  nonlinear  mappings  bv  layered  networks,  the  error  back- 
propagation  algorithm  was  popularized  by  Rumelhart,  Hinton,  and  Williams  [3d|.  The 
success  of  this  latter  algorithm,  and  the  publicity  surrounding  it,  has  been  partly  re¬ 
sponsible  for  the  enormous  increase  in  interest  in  artificial  neural  networks.  Obviously, 
therefore,  we  were  interested  in  investigating  the  relationship  between  the  Ap_r  and 
the  error  back-propagation  methods  and  to  compare  their  performances.  Previous  com¬ 
parative  simulations  [1]  showed  that  back-propagation  is  considerably  faster  than  the 
Ap_p  method  in  some  simple  tasks,  but  we  felt  that  the  /1r_p  method  might  still  have 
some  advantages,  especially  if  we  could  develop  some  of  the  obvious  ways  of  speeding  it 


3 


evaluation  or  reinforcement 
(NOT  desired  response) 


input 

patterns 


stochastic 

output 


Figure  1:  An  Associative  Reward- Penalty  (A/j_p  )  Unit. 


up. 

We  wanted  to  take  advantage  of  the  theoretical  results  of  Williams  [38,  39)  showing 
that  the  weight  vector  of  an  Ap_p  network  follows  a  relevant  performance  gradient  in  a 
statistical  sense;  that  is,  Ap_p  networks  perform  a  stochastic  form  of  gradient  descent. 
Based  on  these  results,  we  developed  a  way  to  apply  the  ,4/?_p  algorithm  to  supervised 
learning  tasks  so  that  the  network  does  something  as  close  as  possible  to  what  back- 
propagation  does,  but  does  it  without  requiring  the  complex  back-propagation  process. 
We  also  introduced  a  method,  which  we  called  “batching,”  for  increasing  the  learning 
efficiency  of  Ap_p  networks.  In  this  method  we  essentially  let  the  Ap_P  units  in  the 
network  generate  several  actions  while  network  input  is  held  constant  at  one  of  the 
training  patterns.  For  each  of  these  actions  the  network  receives  a  reinforcement  value. 
Weights  are  then  updated  according  to  the  resultant  weight  change  computed  over  this 
period  of  constant  input.  We  applied  this  method  to  the  task  of  learning  to  detect 
symmetry  axes  in  binary  patterns  on  a  four-by-four  grid.  In  one  version  of  the  task, 
the  network  has  three  output  units,  and  must  categorize  the  input  as  having  either 
horizontal,  vertical,  or  diagonal  symmetry  (only  one  of  the  two  possible  diagonal  axes  is 
used).  We  also  studied  a  simpler  task  with  a  single  output  unit,  in  which  the  network 
must  discriminate  between  horizontal  or  non-horizontal  symmetry.  For  either  version  of 
the  task,  our  networks  had  sixteen  input  units,  corresponding  to  the  four-by-four  grid, 
and  twelve  hidden  units.  There  was  full  connectivity  between  layers,  yielding  a  total  of 
243  modifiable  weights  and  biases  in  the  case  with  three  output  units,  and  217  modifiable 
weights  and  biases  in  the  case  with  one  output  unit. 
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The  results  showed  that  batching  can  speed  up  learning  by  Ar-.  p  networks,  but  back- 
propagation  is  still  faster.  Nevertheless,  the  Ar_p  method  may  have  some  advantages 
over  back-propagation:  1)  it  is  much  more  plausible  from  a  biological  perspective,  2)  it 
may  be  easier  to  implement  in  hardware,  and  3)  it  is  applicable  to  tasks  to  which  back- 
propagation  is  not,  namely,  tasks  in  which  desired  responses  are  not  known  for  a  set  of 
training  instances.  Barto  and  Jordan  [6]  described  these  results  in  detail.  This  paper 
also  introduced  a  version  of  the  Ar_p  algorithm,  called  the  “S-model  Ar_p  ”,  that  works 
for  real-valued  reinforcement  signals  instead  of  the  binary  ones  to  which  the  original 
.4r_p  method  was  restricted. 

Due  to  the  superiority  of  the  error  back-propagation  method  over  ,4r_p  networks  as 
a  practical  method  for  training  layered  networks  in  supervised-learning  tasks,  we  shifted 
attention  to  tasks  in  which  the  training  information  required  for  supervised  learning  is  ab¬ 
sent.  In  these  tasks,  called  associative  reinforcement  learning  tasks,  the  target  responses 
of  a  network’s  output  units  are  not  known,  but  the  consequences  of  the  network’s  output 
patterns  on  an  unknown  process  can  be  evaluated  by  a  critic  which  at  each  step  supplies 
the  network  with  an  evaluation,  or  reinforcement,  signal.  Figure  2  shows  how  a  network 
can  be  applied  to  such  a  task.  The  critic  in  the  figure  is  shown  as  supplying  signals  to 
a  reinforcement  pathway  for  each  unit  in  the  network,  but  the  important  point  is  that 
learning  occurs  even  when  all  these  pathways  always  transmit  the  jame  signal  to  their 
target  units,  i.e.,  when  a  single  evaluation  signal  is  effectively  broadcast  to  all  the  units.  If 
the  critic  is  able  to  differentiate  this  global  evaluation  by  sending  evaluations  specialized 
for  individual  units,  learning  occurs  more  quickly.  Thus,  although  ,4r_p  units  can  take 
advantage  of  individualized  evaluation  signals,  they  are  also  able  to  learn  as  members  of 
a  “team”  seeking  collective  behavior  that  maximizes  reinforcement  [2]. 

In  associative  reinforcement  learning  tasks  such  as  shown  in  Figure  2,  the  reinforce¬ 
ment  learning  abilities  of  Ap_p  units  are  needed  not  just  for  the  hidden  units  but  for 
the  output  units  as  well.  One  task  of  this  type  with  which  we  experimented  has  been 
called  a  “decentralized  team  decision  problem”  (e.g.  ref.  [11])  We  performed  some  com¬ 
putational  experiments  using  Ap_p  units  as  the  decision  makers  in  a  simple  example  of 
a  decentralized  team  decison  problem.  In  this  task,  there  were  two  decision  makers  each 
making  a  decision  based  on  different  but  correlated  information  about  some  underlying 
uncertainty.  The  outcome  of  the  decisions  depended  on  the  coordinated  actions  of  the 
units.  The  units  had  to  learn  how  to  maximize  a  payoff  through  repeated  trials.  We 
found  that  ,4r_p  units  were  consistently  able  to  learn  how  to  solve  this  task.  This  is 
a  learning  task  in  which  uncertainty  plays  an  intrinsic  role,  and  to  which  supervised- 
learning  methods  are  not  directly  applicable.  This  task  is  described  in  a  book  chapter 
by  Barto  [3],  which  summarizes  a  range  of  results  we  have  achieved  involving  the  collec¬ 
tive  behavior  of  reinforcement-learning  units.  It  also  develops  the  hypothesis  that  there 
may  be  a  close  relationship  between  neuronal  learning  rules  and  chemotaxic  strategies  of 
freely-living  unicellular  organisms. 

Another  aspect  of  our  research  on  the  ,lR_punit  has  been  our  continuing  attempts  to 
prove  a  stronger  version  of  the  .4R_p  convergence  theorem  proved  by  Barto  and  Anandan 
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units 


Figure  2:  An  -4fi_P  network  in  an  associative  reinforcement  learning  task. 

Effective  learning  occurs  even  when  all  the  evaluation  pathways 
transmit  identical  signals. 

[4].  According  to  this  theorem,  an  Ar~p  unit  successfully  maximizes  success  probability 
under  rather  general  conditions,  but  the  theorem  requires  one  condition  which  is  quite 
stringent.  Specifically,  although  the  theorem  allows  success  probabilities  to  depend  on 
/l/i_punit  actions  in  the  most  general  probabilistic  way  (that  is,  for  a  given  input  pattern, 
any  process  whatsoever  can  determine  how  success  or  failure  signals  depend  on  unit 
activity),  the  theorem  requires  that  the  input  patterns  to  the  unit  be  linearly  independent. 
This  is  the  same  condition  ensuring  that  a  unit  using  the  Widrow-Hoff,  or  Least  Mean 
Square,  learning  rule  in  a  supervised  task  can  solve  that  task  exactly.  The  theorem  sheds 
no  light  on  what  happens  when  the  input  patterns  are  not  linearly  independent,  which  is 
the  usual  situation  in  pattern  classification  tasks.  We  have  been  trying  to  prove  a  result 
which  does  not  require  the  assumption  of  linear  independence. 

Unfortunately,  to  remove  this  assumption  requires  a  proof  technique  different  from 
the  one  Barto  and  Anandan  employed.  Consequently,  we  arranged  to  consult  with 
Dr.  P.  S.  Sastry  of  the  Indian  Institute  of  Science.  Bangalore.  India,  an  expert  in  applying 
theories  of  stochastic  convergence  to  learning  problems,  who  was  visiting  this  country. 
We  worked  on  several  approaches  to  proving  a  stronger  convergence  theorem,  and  were 
able  to  arrive  at  results  for  a  special  case  of  the  Ar_p  algorithm.  However,  our  work 
with  Sastry  indicated  that  the  desired  general  result  for  the  the  Ap^p  algorithm  is  not 
going  to  be  easy  to  prove  because  it  seems  to  involve  a  very  complex  stochastic  process. 
We  did  not  abandon  our  goal  of  proving  a  stronger  Ap_P  convergence  theorem,  but  we 
placed  a  low  priority  on  it. 
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2.2  Real-Valued  Stochastic  Reinforcement  Learning 


A  limitation  of  the  Ar_p  unit  for  applications  to  motor  control  problems  is  that  it  is  a 
binary  unit,  that  is,  it  has  just  two  actions.  In  control  tasks  it  is  usually  important  to 
be  able  to  provide  control  signals  with  continuous  values.  It  turns  out  to  be  nontrivial 
to  extend  the  reinforcement  learning  principles  used  by  the  Ap_p  unit  to  the  case  of 
real-valued  outputs.  Basically,  the  difficulty  lies  in  that  fact  that  the  probability  mass  or 
density  function  over  the  set  of  unit  actions  has  to  be  adjusted  with  experience  by  adjust¬ 
ing  certain  parameters.  In  the  case  of  just  two  actions,  adjustment  over  the  complete  set 
of  all  possible  action  probabilities  is  possible  by  adjusting  a  single  parameter,  which  in 
the  case  of  the  A/j_punit,  is  the  unit’s  weighted  sum.  This  is  because  a  probability  mass 
function  for  two  actions  has  a  single  degree-of-freedom  (i.e.,  given  that  the  probability  of 
action  1  is  p,  then  the  probability  of  action  2  must  be  1  —  p).  However,  for  three  or  more 
actions,  you  need  more  than  one  parameter. 

We  developed  a  reinforcement  learning  algorithm  for  a  neuron-like  unit  having  real¬ 
valued  outputs.  We  called  such  a  unit  an  SRV  (Stochastic  Real-Valued)  unit.  An  SRV 
unit’s  output  values  are  generated  by  sampling  from  a  Gaussian  distribution.  The  mean  of 
the  distribution  is  the  weighted  sum  of  inputs  using  one  set  of  weights,  and  the  variance  of 
the  distribution  is  determined  from  a  weighted  sum  of  inputs  using  another  set  of  weights. 
The  learning  rule  adjusts  both  sets  of  weights  as  a  result  of  reinforcement  feedback  so 
that  the  mean  moves  toward  the  optimal  action  for  each  input  pattern  as  the  variance 
decreases.  The  performance  of  SRV  units  was  studied  on  simple  associative  reinforcement 
learning  tasks  (for  example,  AND  and  XOR),  with  good  results.  We  also  applied  SRV 
units  to  a  simulated  motor  control  task  as  described  in  Section  3  below. 

We  also  studied  the  convergence  properties  of  SRV  units  from  a  theoretical  perspec¬ 
tive.  We  used  Martingale  theory  to  analyze  the  behavior  of  simplified  versions  of  the 
SRV  algorithm,  and  have  been  able  to  prove  convergence  of  the  weights  under  certain 
conditions.  The  proof  handles  the  associative  aspects  of  these  units  in  a  manner  simi¬ 
lar  to  the  Ar _p  convergence  proof  of  Barto  and  Anandan  [4].  W'e  are  in  the  process  of 
preparing  this  work  for  submission  to  a  journal.  The  SRV  research  was  conducted  by 
V.  Gullipalli,  a  research  assistant  funded  by  this  grant.  A  technical  report  was  published 
describing  the  SRV  learning  algorithm  and  the  simulation  results  [10].  A  version  of  this 
report  is  in  review  for  the  journal  Neural  Networks. 


3  Motor  Control  and  Robotics 

Considerable  progress  was  made  on  developing  connectionist  methods  for  handling  im¬ 
portant  problems  involving  the  control  of  systems  with  many  degrees-of-freedom.  In  this 
section  we  focus  on  the  research  specifically  involving  tasks  in  motor  control  ami  robotics. 
However,  the  methods  described  are  also  applicable  to  other  types  of  tasks. 


< 


3.1  Sequence  Learning  for  Systems  with  Excess  Degrees  of  Freedom 

This  research  was  a  continuation  of  the  Ph.D.  research  of  M.  I.  Jordan,  who  was  a  post¬ 
doctoral  researcher  funded  by  this  grant.  In  his  Ph.D.  dissertation  [14],  completed  in  1985 
under  the  direction  of  David  Rumelhart  and  Donald  Norman  at  UCSD,  Jordan  developed 
networks  capable  of  learning  sequences  of  patterns.  Jordan’s  approach  differs  in  several 
ways  from  previously  studied  network  methods  for  learning  sequences.  First,  his  networks 
incorporate  a  kind  of  central  pattern  generator.  Through  internal  recurrent  connections, 
his  networks  use  internal  state  information  to  generate  temporal  patterns  without  the 
usual  form  of  pattern-to-next-pattern  chaining.  Second,  Jordan  used  a  training  method  in 
which  the  network’s  output  units  are  given  constraints  on  their  actions  instead  of  explicit 
instruction  as  to  exactly  what  those  actions  should  be.  A  consequence  of  this  training 
method  is  that  the  degrees-of-freedom  of  an  output  pattern  that  are  left  unspecified 
by  the  external  trainer  become  determined  by  the  temporal  context  of  the  pattern.  In 
other  words,  the  specific  way  the  required  constraints  are  met  for  an  output  pattern  at  a 
specific  time  is  determined  by  what  the  constraints  were  in  the  past  and  what  they  will 
be  in  the  future.  The  result  is  a  smooth,  efficient  sequence  of  actions  satisfying  the  given 
constraints. 

A  natural  extension  of  this  approach  is  to  apply  it  to  positioning  tasks  for  multi-jointed 
robot  manipulators.  How  can  one  learn  what  joint  angles  produce  desired  end-effector 
positions  specified  in  spatial  coordinates  (or  other  coordinates,  such  as  eye-position  coor¬ 
dinates,  that  are  not  joint  angles)?  A  critical  aspect  of  this  inverse  kinematic  problem  is 
how  to  choose  from  a  usually  infinite  set  of  joint  angles  that  yield  the  same  desired  end- 
effector  position.  This  occurs  when  the  forward  kinematic  transformation  implemented 
by  the  manipulator  does  not  have  a  unique  inverse  due  to  excess  degrees-of-freedom  in  its 
structure.  Resolving  this  kind  of  redundancy  is  an  important  problem  in  robotic*,  and 
several  approaches  to  it  have  been  studied.  However,  the  approach  based  on  Jordan’s 
work  differes  from  conventional  approaches  and  represents  an  important  contribution  to 
the  field.  Using  the  network  architecture  for  generating  sequences  of  patterns,  it  is  pos¬ 
sible  for  a  system  to  learn  sequences  of  positioning  tasks  in  such  a  way  that  the  problem 
of  excess  degrees-of-freedom  is  resolved  according  to  the  temporal  context  of  each  posi¬ 
tioning  task  in  the  sequence.  How  the  system  learns  to  select,  from  an  infinite  number 
of  possibilities,  a  joint  configuration  that  achieves  any  given  target  position  in  space  is 
determined  by  the  target  positions  that  precede  and  follow  the  given  target  position. 
Configurations  are  selected  which  tend  to  minimize  the  amount  of  movement  that  must 
be  made  in  moving  from  postion  to  position. 

To  accomplish  this,  Jordan  combined  his  approach  to  learning  sequences  with  a 
method  for  using  the  error  back-propagation  algorithm  to  learn  an  inverse  kinematic 
transformation.  The  strategy  is  to  learn  a  forward  kinematic  transformation — a  model 
of  the  robot  arm — by  a  layered  network  in  a  “babbling’-  phase  during  which  random  joint 
angles  are  associated  with  the  resulting  end-effector  spatial  positions.  Then  to  learn  hmv 
to  reach  for  specified  spatial  targets,  the  spatial  error  is  back-propagated  through  the 
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network  that  forms  the  arm  model  to  transform  it  into  a  vector  of  joint-angle  errors  used 
for  training  another  layered  network.  Input  to  this  second  network  comes  from  Jordan's 
recurrent  sequence  generating  network,  and  the  whole  system  is  instructed  to  execute  a 
sequence  of  reaching  tasks.  He  showed  that  the  network  can  take  advantage  of  the  arm  s 
redundancy  to  find  sequences  of  arm  configurations  in  which  the  solutions  at  each  point 
in  time  depend  on  the  solutions  found  at  nearby  points  in  time,  so  that  the  redundancy 
is  used  to  allow  actions  to  overlap  efficiently.  This  approach  was  demonstrated  with  a 
simulated  six  degree-of- freedom  manipulator  in  a  two  degree-of- freedom  world,  as  well 
as  with  a  simulated  manipulator  with  two  fingers.  These  manipulators  were  trained  to 
perform  sequences  of  positioning  tasks.  These  results  were  described  in  a  techincal  report 
[15]  and  a  book  chapter  [16]. 

In  ref.  [15]  Jordan  developes  the  theory  underlying  the  approach  as  a  theory  of  “gener¬ 
alized  supervised  learning.”  As  a  result  of  this  theoretical  perspective,  it  is  possible  to  see 
how  the  approach  can  be  applied  to  a  variety  of  problems  involving  systems  with  excess 
degrees-of- freedom.  These  publications  have  been  widely  distributed,  and  the  research 
they  describe  is  exerting  considerable  influence  on  the  field.  Jordan  left  the  project  in 
January  1988  to  take  a  position  as  Assistant  Professor  in  the  Department  of  Brain  and 
Cognitive  Sciences,  Massachusetts  Institute  of  Technology,  where  he  is  continuing  this 
line  of  research. 

3.2  Inverse  Kinematics  via  Reinforcement  Learning 

We  also  investigated  the  use  of  reinforcement  learning  methods  for  problems  in  robotics 
such  as  the  inverse  kinematic  problem.  The  purpose  of  these  simulations  was  to  investi¬ 
gate  the  utility  of  reinforcment  learning  as  an  alternative  to  Jordan’s  method  described 
above  in  tasks  involving  excess  degrees-of-freedom.  Whereas  the  approach  pursued  by 
Jordan  requires  the  learning  of  a  forward  model  of  the  robot  arm  in  order  to  translate 
spatial  errors  into  joint  errors,  the  approach  using  reinforcement  learning  dispenses  with 
the  necessity  of  such  a  model.  We  experimented  with  a  network  consisting  of  three  SRV 
units  and  16  hidden  “back-prop”  units  (Figure  3)  to  the  problem  of  learning  a  position¬ 
ing  task  with  a  simulated  robot  arm  with  three  degrees  of  freedom  (Figure  4).  The  task 
was  set  up  so  that  there  were  excess  degrees-of-freedom.  The  planar  arm  has  two  joints 
and  its  base  can  move  along  the  top  axis  shown  in  Figure  4.  A  positioning  task  was 
specified  by  selecting  a  target  position,  which  was  considered  attained  if  the  end  of 
the  arm  stopped  anywhere  on  the  vertical  line  at  x^.  Clearly  this  can  be  accomplished 
with  an  infinite  number  of  different  arm  configurations.  The  evaluation  signal  which  was 
broadcast  to  the  output  SRY  units  provided  a  scalar  measure  of  the  spatial  distance  of 
the  end  of  the  arm  from  the  target  line  at  r^.  The  hidden  units  were  trained  via  back- 
propagation  of  the  estimated  evaluation  gradient.  Figure  4  shows  one  sequence  of  arm 
positions  generated  by  the  network  shown  in  Figure  3  after  learning.  Results  suggested 
that  SRV  units  can  be  useful  in  learning  tasks  involving  excess  degrees-of-freedom  [10] 
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Figure  4:  The  three  degree-of-freedom  planar  arm  used  to  study  SRV  units 
in  positioning  tasks.  A  target  position  is  reached  when  the  end 
of  the  arm  stops  anywhere  along  the  vertical  line  at  xd. 

4  Theoretical  Framework  for  Network  Learning 

While  supported  as  a  research  assistant  by  the  grant,  J.  S.  Judd  developed  a  formal 
framework  in  which  to  address  some  important  questions  about  the  computational  limits 
of  network  learning.  He  formalized  a  notion  of  learning  in  connectionist  networks  that 
characterizes  the  training  of  feed-forward  networks.  Considering  different  families  of  node 
functions,  i.e.,  the  functions  that  individual  network  elements  compute, Judd  proved  that 
the  learning  problem,  so  formulated,  is  NP-complete  and  thus  that  it  has  no  efficient 
general  solution.  One  family  of  node  functions  studied  is  the  set  of  logistic-linear  func¬ 
tions,  as  used  by  the  back-propogation  algorithm.  Additional  theoretical  results  describe 
special  classes  of  network  topologies  that  can  be  trained  in  polynomial  time. 

Essentially,  the  learning  goal  as  formulated  is  to  find  one  algorithm  that  is  guaranteed 
to  load  any  performable  task  in  any  conceivable  feed-forward  network,  where  loading  a 
task  means  specifying  the  functions  implemented  by  all  the  nodes  in  the  network.  I* 
was  proved  that  this  problem  has  no  efficient  general  solution.  However,  several  ways 
were  considered  to  weaken  the  formulation  so  as  to  possibly  yield  an  achievable  goal. 
There  may  be  large  useful  classes  of  networks  (defined  by  some  design  restrictions)  where 
loading  a  task  would  always  be  achievable  in  polynomial  time.  One  can  imagine  several 
ways  to  constrain  the  class  of  networks  and/or  tasks  and,  or  other  aspects  in  such  a  way 
that  the  new  loading  problem  would  have  some  special  regularity  that  might  facilitate 
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its  solution.  A  wide  range  of  questions  regarding  narrowed  or  altered  models  of  the 
connectionist  learning  goal  are  discussed  in  ref.  [19].  Answers  to  some  of  these  questions 
will  assist  connectionist  learning  research  bv  narrowing  its  focus  to  those  cases  that  hold 
the  promise  of  scaling  up.  We  believe  that  there  is  a  great  need  for  theory  of  this  kind 
to  increase  the  sophistication  of  connectionist  research. 

Several  publications  by  Judd  describe  aspects  of  this  research:  refs.  [20,  18,  17].  Judd 
received  the  Ph.D.  degree  in  September,  1988,  and  a  revised  version  of  his  dissertation 
on  this  subject  will  be  published  as  a  book  by  The  MIT  Press.  Judd’s  work  is  receiving 
considerable  attention— -not  onlv  because  it  contains  some  of  the  few  results  of  this  kind 
about  neural  networks — but  also  because  it  may  have  practical  significance  for  designing 
networks  that  are  easy  to  train.  Judd  was  appointed  Adjunct  Assistant  Professor  at  the 
University  of  Massachusetts,  and  is  now  a  Visiting  Professor  at  the  California  Institute 
of  Technology  in  Pasadena,  where  he  is  continuing  this  research. 


5  Increasing  Learning  Rate  through  Learning  Rate  Adapta¬ 
tion 

Despite  the  negative  nature  of  Judd’s  results  on  the  scaling  up  of  network  learning,  we 
pursued  several  approaches  to  increasing  learning  rates  of  networks.  The  first  approach 
is  to  alter,  during  the  learning  process,  the  parameters  that  determine  how  the  size  and 
direction  of  steps  in  weight  space  are  computed  as  functions  of  the  error  gradient  or 
gradient  estimate.  For  example,  the  use  of  “momentum”  in  the  error  back-propagation 
method  is  an  example  of  this  approach.  Although  such  methods  cannot  profoundly 
increase  the  size  of  networks  that  can  be  feasibly  trained,  they  are  nevertheless  useful  in 
practice  for  reducing  the  amount  of  computation  required  for  studying  network  learning. 
The  second  approach  we  pursued  is  to  develop  networks  whose  architectures  are  well- 
suited  for  specific  tasks.  This  approach,  an  example  of  which  is  discussed  in  Section  6,  is 
not  subject  to  Judd’s  theorem  because  it  involves  structuring  network  architectures  and 
learning  tasks  instead  of  seeking  a  fast  algorithm  for  arbitrary  networks  and  tasks. 

Methods  such  as  Newton’s  method,  recursive  least  squares,  and  conjugate-gradient 
methods  can  be  regarded  as  including  sophisticated  means  for  adjusting  learning  rate 
parameters  during  learning.  However,  these  methods  cannot  be  implemented  by  neuron¬ 
like  adaptive  elements  unless  updating  each  weight  uses  information  about  the  input, 
signals  on  all  of  the  unit’s  input  pathways.  That  is.  these  methods  are  not  as  local  as 
the  simpler  methods  used  in  most  connectionist  research  Fxtending  these  methods  to 
entire  networks  of  units  again  requires  the  use  of  information  that  is  not  likely  to  be 
locally  available  to  the  units  in  real  neural  networks  (or  would  be  difficult  to  supply  in 
VLSI  implementations  of  artificial  neural  networks).  We  therefore  focused  attention  on 
methods  that  conform  to  a  locality  constraint. 

R..  A.  Jacobs,  a  graduate  student  supported  by  the  grant,  examined  local  methods  of 
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adaptively  adjusting  learning  rate  parameters  that  have  been  proposed  in  the  engineering 
literature,  developed  several  new  methods,  and  tested  these  methods  in  a  selection  of 
layered-network  learning  tasks.  He  summarized  his  findings  in  four  heuristics  that  provide 
guidelines  for  how  to  achieve  faster  rates  of  convergence  than  steepest  descent  techniques: 
first,  every  parameter  of  the  performance  measure  to  be  minimized  should  have  its  own 
individual  learning  rate;  second,  every  learning  rate  should  be  allowed  to  vary  over  time; 
third,  when  the  derivative  of  a  parameter  possesses  the  same  sign  for  several  consecutive 
time  steps,  the  learning  rate  for  that  parameter  should  be  increased;  fourth,  when  the 
sign  of  the  derivative  of  a  parameter  alternates  for  several  consecutive  time  steps,  the 
learning  rate  for  that  parameter  should  be  decreased. 


We  studied  one  method  for  modifiying  rate  parameters  that  we  called  the  “delta- 
delta”  rule.  This  rule  performs  steepest  descent  on  an  error  surface  defined  over  learning 
rate  parameter  space.  If  e,  is  the  learning  rate  for  the  tth  weight,  then  it  is  updated 
according  to: 


Ae;(t)  =  7 


dJ{t)  dJ(t  -  1) 

dwi(t)  dwi(t  —  1)  ’ 


where  J  is  the  function  of  the  weights  that  is  being  minimized  by  the  network  and  7  is 
a  parameter.  This  algorithm  for  updating  the  learning  rates  implements  the  heuristics 
listed  above.  When  the  sign  of  the  derivative  of  a  weight  is  the  same  on  consecutive 
time  steps,  the  algorithm  increases  the  learning  rate  for  that  weight.  When  the  sign  of 
the  derivative  of  a  weight  alternates  on  consecutive  time  steps,  the  algorithm  decreases 
the  learning  rate  for  that  weight.  Unfortunately,  there  are  several  problems  with  this 
rule  that  limit  its  practical  use.  To  remedy  these  difficulties,  we  developed  a  related 
algorithm  called  the  “delta-bar-delta”  rule  that  is  a  bit  more  complicated  in  that  it  uses 
an  exponential  average  of  the  current  and  past  partial  derivatives. 


We  compared  the  performance  of  several  rules  for  updating  learning  rate  parame¬ 
ters  by  applying  them  to  four  tasks.  These  tasks  were  the  optimization  of  quadratic 
surfaces,  and  the  learning  of  the  exclusive-or,  multiplexer,  and  binary-to-local  functions. 
These  tasks  were  chosen  because  their  error  surfaces  possess  a  variety  of  terrains.  The 
update  rules  tested  were:  steepest  descent,  momentum,  delta-bar-delta,  and  a  hybrid 
algorithm  that  combines  the  momentum  and  delta-bar-delta  procedures.  The  last  three 
tasks  require  the  use  of  multi-layer  networks.  For  all  algorithms,  the  back-propagation 
procedure  was  used  to  calculate  the  partial  derivative  of  the  error  with  respect  to  each 
weight.  The  simulation  results  provide  support  for  the  four  heuristics  for  how  to  achieve 
rates  of  convergence  substantially  faster  than  steepest  descent  algorithms  and  show  that 
the  delta-bar-delta  and  hybrid  methods  substantially  accelerate  learning.  A  paper  by  .Ja¬ 
cobs  describing  these  results  were  published  in  the  October  1988  issue  of  Neural  Network's 

[121. 


Although  the  approach  to  accelerating  learning  using  procedures  of  this  kind  does 
yield  speed  increases,  it  is  not  likely  that  this  approach  will  make  a  significant  difference 
for  large  problems  (especially  in  light  of  Judd’s  theorem).  We  therefore  began  to  focus  on 
methods  for  accelerating  learning  that  we  believe  can  have  a  greater  impact  in  practical 
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Figure  5:  A  modular  network  consisting  of  two  expert  networks  and  a  gating  network. 

applications.  These  methods,  described  next,  are  based  on  structuring  both  the  training 
process  and  the  networks. 


6  Modular  Network  Architectures 

An  approach  to  improving  the  learning  ability  of  connectionist  systems  is  to  organize 
several  networks  into  modular  architectures.  One  advantage  of  such  a  structure  is  that 
individual  networks  are  not  faced  with  solving  a  large  problem  in  its  entirety.  Large  prob¬ 
lems  are  solved  by  the  combined  efforts  of  several  networks.  This  requires  that  a  problem 
be  broken  into  subproblems,  and  subproblems  into  subsubproblems,  etc.  We  developed 
a  learning  method  for  a  modular  architecture  consisting  of  several  networks,  which  we 
call  “expert  networks”,  specialized  for  different  kinds  of  tasks,  and  a  “gating  network” 
that  learns  how  to  switch  in  the  best  expert  network  for  a  particular  subtask.  The  expert 
networks  compete  to  learn  about  training  patterns.  Through  such  competition,  different 
expert  networks  are  allocated  to  learn  different  functions.  This  approach  can  accelerate 
the  learning  process  if  the  architectures  of  the  expert  networks  are  designed  based  on 
some  prior  knowledge  of  subtasks,  and  it  can  permit  the  modular  network  to  efficiently 
learn  to  perform  multiple  tasks  by  allocating  different  expert  networks  for  each  task 

Consider  the  architecture  illustrated  in  Figure  5.  It  contains  two  types  of  networks. 
The  expert  networks  compete  to  learn  and  perform  t  raining  patterns.  The  gating  net  work 
mediates  this  competition.  After  training,  expert  networks  1  and  2  compute  different 
functions  that  are  useful  in  different  regions  of  the  domain.  Let  the  output  of  these 
networks  be  labeled  E\  and  £"2  respectively.  The  gating  network  is  an  administrative 
agency  that  decides  whether  expert  network  1  or  2  is  currently  applicable.  This  network 
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contains  two  output  units  labeled  g\  and  g2  respectively.  The  output  of  the  system,  O, 
equals  g\E\  +  g2E2.  Therefore,  when  g\  =  1  and  g2  —  0,  expert  network  1  determines  the 
output  of  the  system.  Similarly,  when  gi  =  0  and  g2  =  1,  expert  network  2  determines 
the  output  of  the  system. 

During  training,  all  networks  modify  their  weights  simultaneously  using  the  back- 
propagation  algorithm  [30].  However,  the  expert  and  gating  networks  attempt  to  mini¬ 
mize  different  error  functions.  At  each  time  step,  the  expert  networks  attempt  to  min¬ 
imize  the  sum  of  squared  error  between  the  output  of  the  system,  O,  and  the  desired 
output,  O' .  This  errror  function  is  written 

Jo  =  \{0-  -0)T(0-  -O).  (1) 

The  gating  network  attempts  to  minimize  a  more  complicated  error  function.  The 
intuition  behind  this  function  is  as  follows.  For  each  training  pattern,  one  expert  network 
comes  closer  to  producing  the  desired  output  than  the  other  expert  networks.  In  the 
competition  among  networks,  this  one  is  called  the  winner  and  all  others  are  losers. 
Suppose  that  on  this  training  pattern,  the  system’s  performance  is  significantly  better 
than  it  has  been  in  the  past.  In  this  case,  the  output  of  the  gating  network  corresponding 
to  the  winning  expert  network  is  increased  towards  one  and  the  outputs  corresponding 
to  the  losing  expert  networks  are  decreased  toward  zero.  Alternatively,  if  the  system’s 
performance  has  not  improved,  then  all  outputs  of  the  gating  network  are  moved  toward 
a  neutral  value. 

Mathematically,  this  intuition  is  expressed  as  follows.  First,  we  determine  if  the 
system’s  performance  is  significantly  better  than  it  has  been  in  the  past.  If  t  is  the 
current  time  step,  then  the  error  Jo{t )  is  a  measure  of  the  current  performance.  The 
measure  of  the  system’s  past  performance  is  the  exponential  average  over  time  of  J0. 
This  value,  labeled  Jo,  is  computed  by 

~Jo(t)  =  cxj0(t)  +  (1  —  a)Jo(t  —  1).  (2) 

We  use  the  binary  variables  \\vta  {WTA  stands  for  winner-take-all)  and  {NT  stands 
for  neutral)  to  indicate  whether  the  system’s  performance  has  significantly  improved. 
Specifically, 

If  Jo{t)  <  l~Jo{t  —  1 )  (3) 

Then  \wta  -  1  and  ^,vr  -  *'• 

Else  Ajj'j-  (  —  ll  and  A^j-  I. 

Suppose  that  the  system's  performance  has  significantly  improved.  In  this  case,  we 
determine  which  expert  network’s  output  is  closest  to  the  desired  output.  Define  the 
error  for  expert  network  i  to  be  the  sum  of  squared  error  between  the  network's  output 
Et  and  the  desired  output  O'.  This  value,  labeled  J/ji ,  is  written 

Je.  =  \(0~  -  E,)T(0-  -  £,).  (4) 
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The  winning  expert  network,  labeled  w,  is  the  network  with  the  smallest  error.  All 
other  expert  networks  are  losers.  The  desired  value  of  the  output  of  the  gating  network 
corresponding  to  the  winning  expert  network,  labeled  g is  set  to  1.  The  desired  values 
of  the  outputs  corresponding  to  the  losing  expert  networks,  labeled  g{ ,  are  set  to  0. 
Alternatively,  if  the  system’s  performance  has  not  significantly  improved,  then  the  desired 
values  for  all  outputs  of  the  gating  network  are  set  to  a  neutral  value.  This  value  is  ^ , 
where  n  is  the  number  of  expert  networks. 

The  gating  network’s  error  function  is: 

1  n 

Jg  =  ^WTATf  ~  Sif  +  (5) 

1  i= i 

•*W7\4x(l  -  51  3i )2  + 

Z  i=l 

n 

^WTA  51  -  9i )  + 

i=l 

Only  the  first  three  terms  contribute  to  the  error  when  the  system’s  performance  has 
significantly  improved.  Otherwise,  only  the  fourth  term  contributes  to  the  error.  The 
first  term  is  the  sum  of  squared  error  between  the  desired  outputs  and  the  actual  outputs 
of  the  gating  network.  Minimization  of  the  second  term  occurs  when  the  outputs  of  the 
gating  network  sum  to  one.  Minimization  of  the  third  term  occurs  when  the  outputs 
of  the  gating  network  are  binary  valued.  The  effect  of  minimizing  the  second  and  third 
terms  is  that,  in  response  to  each  input  pattern,  one  output  of  the  gating  network  equals 
one  and  all  others  equal  zero.  The  fourth  term  is  the  sum  of  squared  error  between  the 
neutral  value  and  the  actual  outputs  of  the  gating  network. 

The  gating  network  determines  how  much  each  expert  network  learns  about  each 
training  pattern.  Referring  to  Figure  5,  note  that  the  error  vector  back-propagated 
into  expert  network  1  is  g\{0~  —  O )  and  the  error  vector  back-propagated  into  expert 
network  2  is  gz(0'  —  O).  Thus,  the  gating  network  determines  the  magnitudes  of  the 
expert  networks’  error  vectors. 

Several  investigators  have  noted  that  the  selection  of  a  network’s  topology  is  extremely 
important  since  the  topology  determines  what  functions  the  network  can  readily  learn 
and  what  functions  it  can  only  learn  with  great  difficulty,  if  at  all.  Furthermore,  the 
topology  also  influences  a  network's  ability  to  generalize.  Frequently,  an  experimenter 
can  use  domain  knowledge  to  select  a  set  of  expert  network  topologies  that  are  potentially 
useful  for  rapidly  learning  the  tasks  faced  by  the  architecture.  An  advantage  of  requiring 
the  expert  networks  to  compete  to  learn  and  perform  training  patterns  is  t  hat  the  net  work 
whose  topology  most  facilitates  the  learning  of  the  function  that  generates  the  current 
training  patterns  is  likely  to  win  the  competition.  Thus,  our  architecture  tends  to  allocate 
to  each  function  an  expert  network  with  a  topology  that  is  appropriate  to  that  function. 
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We  tested  this  architecture  on  a  simple  vision  task  and  on  a  robotics  task.  The  vision 
task  was  proposed  by  Rueckl,  Cave,  and  Kosslyn  [29]  who  compared  the  performance 
of  two  connectionist  systems  on  an  object  recognition  task  (henceforth,  referred  to  as 
the  “what”  task)  and  a  spatial  localization  task  (henceforth,  referred  to  as  the  “where” 
task).  In  their  study,  the  first  system  is  a  single  network  which  is  required  to  learn  both 
tasks.  The  second  system,  on  the  other  hand,  consists  of  two  networks,  one  for  each 
task.  During  training  of  the  systems,  one  of  nine  patterns  was  placed  at  one  of  nine 
locations  on  a  5  A' 5  matrix.  The  “what”  task  is  to  identify  the  pattern.  The  “where” 
task  is  to  identify  the  spatial  location.  Rueckl,  Cave,  and  Kosslyn  [29]  report  that  the 
second  system  is  superior  to  the  first  system  in  the  sense  that  it  learns  the  tasks  faster 
and  develops  a  more  interpretable  representation. 

An  issue  that  Rueckl,  Cave,  and  Kosslyn  did  not  address,  and  the  issue  with  which 
we  were  primarily  concerned,  is  the  development  of  a  system  that  can  learn  if  it  is  better 
to  perform  two  or  more  tasks  in  distinct  networks  and,  if  so,  can  itself  allocate  distinct 
networks  to  learn  each  task.  Such  a  system  would  have  the  ability  to  learn  how  to  parti¬ 
tion  a  task  into  subtasks  and  allocate  these  tasks  to  expert  networks.  Simulation  results 
demonstrate  that  the  architecture  and  learning  rule  we  developed  learns  to  allocate  dis¬ 
tinct  expert  networks  to  the  “what”  and  “where”  tasks.  Furthermore,  the  architecture 
tends  to  allocate  a  single-layer  network  to  the  “where”  task  (this  task  is  linearly  separa¬ 
ble)  and  a  multi-layer  network  to  the  “what”  task  (this  task  is  not  linearly  separable). 
Thus,  these  results  suggest  that  the  architecture  learns  to  allocate  to  each  task  an  expert 
network  with  a  topology  that  is  appropriate  to  that  task. 

The  robotics  task  on  which  we  tested  this  modular  architecture  is  the  task  of  learning 
to  control  a  robot  arm  to  move  a  variety  of  payloads,  each  with  a  different  mass,  along 
a  desired  trajectory.  The  architecture  was  successfully  trained  to  serve  as  a  feedfor¬ 
ward  controller  for  the  robot  arm  using  a  training  technique  previously  used  by  Kawato, 
Furukawa,  and  Suzuki  [21]  and  Miller  [26].  During  training,  the  architecture  learns  to 
allocate  one  expert  network  to  control  the  arm  with  no  payload,  a  second  expert  network 
to  control  the  arm  with  a  light  payload,  and  a  third  expert  network  to  control  the  arm 
with  a  heavy  payload. 

VVe  also  trained  a  modification  of  the  modular  architecture  to  perform  this  trajectory- 
following  task.  This  modified  architecture  includes  a  “share  network”  whose  output 
contributes  to  the  output  of  the  system  at  all  times.  During  training,  the  share  network 
learns  to  control  the  arm  with  no  payload  and  the  expert  networks  learn  to  supply  extra 
torques  in  order  to  compensate  for  the  mass  of  each  payload.  In  this  sense,  the  modified 
architecture  learns  to  solve  a  task  by  learning  a  shared  strategy  that  is  used  in  all  contexts 
along  with  a  set  of  modifications  to  this  strategy  that  are  applied  in  a  context  sensitive 
manner. 

A  preliminary  discussion  of  this  approach  to  modular  architectures  appeared  as 
ref.  [13],  and  Jacobs  is  currently  writing  a  Ph.D.  dissertation  on  this  topic  which  we 
expect  to  be  completed  in  the  Spring  of  1990. 
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7  Reinforcement  Learning  for  Control  of  Dynamical  Systems 


In  research  funded  by  previous  AFOSR  grants,  we  applied  adaptive  networks  to  the  “pole¬ 
balancing”  task  in  which  the  network  was  required  to  learn  how  to  prevent  a  pole  from 
falling  by  exerting  appropriate  control  actions  [8,  32,  lj.  Although  we  learned  a  lot  from 
that  research,  its  character  was  too  heuristic  to  appeal  directly  to  the  adaptive  control 
engineering  community.  We  began  the  development  of  a  more  rigorous  view  of  the  type  of 
method  implemented  by  the  pole-balancing  controller.  This  effort  has  resulted  in  major 
insights  into  relationships  between  reinforcement  learning  methods  for  control  and  more 
orthodox  engineering  methods  and,  consequently,  better  understanding  of  what  may  be 
the  strengths  and  weaknesses  of  connectionist  reinforcement  learning.  In  particular,  it 
has  become  clear  that  the  most  relevant  existing  mathematical  framework  is  the  theory 
stochastic  sequential  decision  problems  and  the  most  relevant  computational  methods  are 
those  of  stochastic  dynamic  programming.  We  have  studied  these  connections  through 
interaction  with  C.  Watkins,  Philips  Research  Laboratories,  whose  Ph.D.  dissertation 
[35]  develops  this  connection,  discussions  with  P.  J.  Werbos,  of  The  National  Science 
Foundation,  who  began  exploring  these  connections  in  the  mid-1970s  [36,  37],  as  well  as 
continuing  interaction  with  R.  S.  Sutton,  of  GTE  Laboratories,  Inc.  These  connections, 
which  are  briefly  outlined  here,  are  discussed  in  detail  by  Barto,  Sutton,  and  Watkins  [9]. 
There  is  a  huge  literature  on  sequential  decision  problems  and  dynamic  programming.  A 
relatively  recent  and  concise  account  is  provided  by  Ross  [28]. 

Stochastic  sequential  decision  problems  involve  a  decision-making  system  (let  us  call 
it  the  Decision  Maker,  or  DM)  interacting  with  a  dynamical  system  in  such  a  wav  that 
at  the  beginning  of  each  of  a  series  of  discrete  time  periods,  the  DM  observes  the  sys¬ 
tem’s  current  state.  Based  on  the  observed  state,  the  DM  selects  an  action  that  will 
influence  the  system’s  behavior.  After  the  action  is  performed,  the  DM  receives  a  certain 
amount  of  payoff  that  depends  on  both  the  current  system  state  and  the  action,  and  the 
system  undergoes  a  state  transition  determined  by  its  current  state,  the  action  that  was 
performed,  and  random  disturbances.  Upon  observing  the  new  state,  the  DM  chooses 
another  action  and  continues  in  this  manner  for  a  sequence  of  time  periods.  The  task 
of  the  DM  is  to  form  a  rule  for  selecting  actions,  called  a  policy,  that  maximizes  the 
expected  value  of  the  sum  of  the  payoff  earned  over  future  time  periods.  The  return 
of  a  policy  refers  to  the  sum  of  payoff  received  over  time  by  a  DM  using  that  policy. 
The  objective  is  therefore  to  form  a  policy  that  maximizes  the  expected  return.  These 
tasks  are  specific  types  of  discrete-time  control  tasks  where  tbe  policy  corresponds  to  a 
state-feedback  control  law. 

The  number  of  time  periods,  each  corresponding  to  the  selection  and  performance 
of  a  single  action,  over  which  the  return  of  a  policy  is  determined  is  the  lwn:on  of  the 
decision  problem.  It  is  usual  to  distinguish  problems  according  to  whether  the  horizon 
is  finite  or  infinite.  In  finite- horizon  problems,  one  desires  a  policy  that  maximizes  the 
expected  return  over  a.  given  finite  number  of  time  periods.  In  infinite-horizon  problems, 
one  desires  a  policy  that  would  maximize  the  expected  return  over  an  infinite  number  of 
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time  periods.  A  discount  factor  is  often  used  to  weight  payoff  values  so  that  the  farther 
in  the  future  a  payoff  is  expected  to  occur,  the  less  it  contributes  to  the  sum  that  is  to 
be  maximized.  In  this  case,  a  policy’s  return  is  a  weighted  sum  of  the  payoff  values  that 
the  policy  will  produce  over  future  time  periods,  where  each  weight  depends  both  on  the 
discount  factor  and  when  the  payoff  is  received.  The  infinite-horizon  discounted  case  is 
particularly  interesting  from  a  mathematical  point  of  view  and  is  the  case  to  which  our 
reinforcement  learning  are  most  closely  related. 

The  return  expected  over  the  future  depends  on  the  discount  factor,  the  current  state 
of  the  system,  and  the  policy  the  DM  will  use  over  the  future.  The  evaluation  function 
for  a  given  policy  and  discount  factor  assigns  to  each  state  the  expected  discounted  return 
given  that  the  decision  problem  begins  in  that  state  and  the  DM  uses  the  given  policy 
over  the  entire  future.  The  objective  of  the  decision  task  is  to  find  a  policy  (there  may 
be  many)  such  that,  for  a  given  discount  factor,  the  corresponding  evaluation  function 
takes  on  values  that  are  as  large  as  possible.  Such  a  policy  is  an  optimal  policy ,  and  the 
evaluation  function  corresonding  to  it  is  the  optimal  evaluation  function ,  which  is  unique 
for  a  given  discount  factor. 

Because  so  many  problems  of  practical  interest  can  be  formulated  as  stochastic  sequen¬ 
tial  decision  problems,  there  is  an  extensive  literature  devoted  to  the  study  of  solution 
methods  for  this  type  of  problem,  the  large  majority  of  which  require  the  decision  maker 
to  have  a  complete  model  of  the  dynamical  system  underlying  the  decision  problem. 
Aside  from  extreme  brute- force  search  methods,  dynamic  programming  (DP)  methods 
provide  the  only  methods  for  solving  these  problems  in  the  general  case  of  non  linear  sys¬ 
tems.  Stochastic  dynamic  programming  methods  apply  to  stochastic  sequential  decision 
probems  described  above. 

For  finite-horizon  problems,  DP  techniques  work  by  computing  backwards  from  the 
end  of  a  problem  to  its  beginning,  calculating  information  pertinent  to  decision  making 
at  each  step  based  on  information  previously  calculated  from  that  step  to  the  problem’s 
end.  In  the  stochastic  case,  if  there  is  one  step  remaining  in  the  task,  the  expected 
return  for  each  possible  action  can  be  computed  on  the  basis  of  the  knowledge— assumed 
to  be  available — about  the  system  state  transitions  and  payoff  probabilities.  Thus,  for 
each  state-action  pair,  one  computes  the  payoff  expected  in  one  step,  i.e.,  the  expected 
one-step  return.  Any  optimal  decision  policy  must  select  the  action  that  maximizes 
this  expected  one-step  return  when  there  is  one  step  remaining  in  the  decision  problem. 
Then,  given  that  we  know  the  maximal  expected  return  from  each  state  for  a  one-step 
problem  (which  we  have  just  computed),  we  can  compute  the  expected  two-step  return 
for  each  state-action  pair  by  treating  the  two-step  problem  as  a  one-step  problem  where 
the  expected  return  is  the  expected  immediate  payoff  on  the  first  step  plus  the  expected 
return  for  one  more  step — which  is  the  quantity  already  computed.  The  optimal  decision 
for  the  penultimate  step  of  the  problem  selects  the  action  that  maximizes  this  expected 
two-step  return.  This  process  repeats  until  the  entire  optimal  decision  policy  is  specified. 
If  the  problem  has  an  infinite  horizon,  this  iterative  method  can  be  modified  slightly  so 
that  it  successively  approximates  the  infinite-horizon  case. 
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Stochastic  DP  requires  scales  very  poorly  to  large  problems.  As  the  number  of  process 
states,  actions,  and  steps  in  the  decision  problem  increase,  the  amount  of  computation 
required  quickly  becomes  prohibitive.  Consequently,  the  problem  of  forming  estimates  of 
optimal  evaluation  functions  and  optimal  policies  without  performing  all  of  this  compu¬ 
tation  has  great  practical  significance.  The  reinforcement  learning  methods  that  we  have 
studied  can  be  seen  as  such  approximation  methods  that  have  the  additional  property  of 
being  applicable  when  complete  knowledge  of  the  dynamical  system  underlying  a  decision 
task  is  absent. 

When  a  complete  model  of  the  dynamical  system  underlying  a  sequential  decision 
task  is  not  available,  it  is  necessary  to  learn  about  the  system  while  interacting  with 
it.  One  approach  is  to  construct  a  model  of  the  system  underlying  the  decision  problem 
in  the  form  of  estimates  of  state-transition  and  payoff  probabilities  and  then  apply  DP 
methods  under  the  assumption  that  the  system  model  is  accurate  (e  g.,  refs.  [24,  25,  31]). 
In  the  nonlinear  case,  these  methods  scale  very  poorly  because  they  require  repeated 
application  of  DP  methods.  Another  approach  is  to  directly  adjust  the  decision  policy 
as  a  result  of  observed  consequences  of  the  decisions  it  specifies.  Here,  the  DM  tries 
out  a  variety  of  decisions,  observes  their  consequences,  and  directly  adjusts  its  policy  in 
order  to  improve  it.  It  is  possible  to  facilitate  this  direct  learning  of  a  decision  policy  by 
combining  it  with  a  process  for  estimating  an  evaluation  function  so  that  the  long-term 
consequences  of  actions  are  reflected  in  evaluations  that  are  available  immediately  after 
an  action  is  performed.  This  is  the  approach  we  took  in  the  pole-balancing  system  [8], 
where  the  “Associative  Search  Element”  adjusted  the  policy  and  the  “Adaptive  Critic 
Element”  estimated  the  evaluation  function  corresponding  to  the  evolving  policy.  In 
fact,  as  discussed  in  ref.  [9],  the  learning  rule  used  by  the  Adaptive  Critic  Element  can  be 
understood  in  terms  of  a  functional  equation  from  DP.  Although  this  is  a  “model-free” 
approach  to  learning  a  decision  policy,  it  does  not  preclude  the  additional  use  of  system 
models.  Methods  combining  model-free  techniques  with  model-based  methods  will  be  a 
major  emphasis  of  future  research. 

Within  the  framework  of  sequential  decision  problems  and  DP  methods,  the  connec- 
tionist  reinforcement  learning  methods  we  have  studied  are  best  viewed  as  Monte  Carlo 
methods  for  approximating  the  results  of  stochastic  DP  methods  and  are  applicable  when 
there  is  no  complete  model  of  the  dynamical  system  underlying  the  decision  task.  In¬ 
stead  of  computing  optimal  policies  and  evaluation  functions  using  a  system  model  and 
DP  methods,  these  functions  are  directly  approximated  by  connectionist  networks  on 
the  basis  of  sequences  of  trials  with  the  decision  task.  Understanding  these  relation¬ 
ships  to  existing  theories  has  greatly  contributed  to  our  goal  of  establishing  connectionist 
reinforcement  learning  methods  as  rigorously  defensible  approaches  to  learning  control 
applicable  to  complex  nonlinear  control  tasks.  As  pointed  out  by  Werbos  [371,  coupling 
connectionist  function  approximation  techniques  to  Monte  Carlo  DP  provides  a  means 
for  bringing  connectionist  algorithms  and  hardware  to  bear  on  sequential  decision  tasks 
that  are  too  large  and  involve  too  much  uncertainty  to  permit  solution  by  existing  exact 
methods. 
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8  Imperfect  State  Information 


Methods  for  solving  sequential  decision  problems  based  on  dynamic  programming  such 
as  those  discussed  in  Section  7  rely  on  having  access  to  the  state  of  the  dynamical 
system  underlying  the  decision  problem.  When  this  information  is  absent,  additional 
methods  must  be  used  to  provide  estimates  of  the  current  state  of  the  system.  Additional 
complexities  arise  if  the  problem  is  not  just  to  estimate  the  state  of  a  known  dynamical 
system  but  to  construct  a  dynamical  model  based  on  observable  information  whose  states 
are  to  provide  input  to  the  decision-making  process.  Although  there  exists  a  large  body 
of  literature  on  state  estimation  and  system  identification,  we  have  pursued  some  ideas 
that  are  not  easily  placed  within  the  spectrum  of  traditional  methods,  having  stronger 
ties  to  grammatical  inference  than  to  traditional  engineering  methods.  Below  is  a  brief 
description  of  this  research  which  makes  up  part  of  the  Ph.D.  research  of  J.  Bachrach,  one 
of  the  graduate  students  supported  by  the  grant,  who  is  currently  writing  a  dissertation 
on  this  topic  expected  to  be  completed  in  the  Spring  or  Summer  of  1990. 

This  work  began  with  an  investigation  of  methods  for  training  simple  reverberatory 
circuits  to  act  as  memory  devices.  For  example,  a  connectionist  unit  that  excites  itself 
through  a  recurrent  connection  can  be  “set”  or  “reset”  like  an  SR  flipflop.  This  kind 
of  memory  is  different  from  the  kind  of  long-term  memory  that  is  stored  in  connection 
weights.  The  problem  is  to  learn  when  to  set  or  reset  these  bits  in  a  variety  of  paradigms. 
This  kind  of  knowledge  would  be  stored  in  connection  weights.  Early  approaches  to  this 
type  of  problem  led  to  an  investigation  of  research  being  conducted  by  Schapire  and  Rivest 
of  the  MIT  Laboratory  for  Computer  Science,  who  designed  an  algorithm  for  constructing 
a  model  of  a  finite-state  environment  through  exploration  [27j.  We  have  developed  a 
connectionist  network  based  on  the  representation  of  finite-state  automata  used  in  the 
Rivest-Shapire  (RS)  algorithm.  This  representation  has  a  natural,  direct  connectionist 
implementation  and  is  able  to  strongly  constrain  the  network’s  architecture.  Although 
the  network  explores  the  environment  in  the  simplest  possible  wav — bv  choosing  random 
actions — for  simple  environments,  the  network  can  outperform  the  Rivest  and  Schapire 
algorithm  because  it  is  able  to  consider  many  hypotheses  in  parallel.  The  network  has 
the  additional  strength  that  it  is  applicable  to  nondeterministic  environments. 

As  a  simple  example,  consider  an  environment  consisting  of  n  rooms  arranged  in  a 
circle,  with  a  light  and  light  switch  in  each  room.  In  a  given  room,  the  decision  maker 
can  take  one  of  three  actions:  move  to  the  room  on  the  left,  move  to  the  room  on  the 
right,  and  toggle  the  light  switch  in  the  current  room.  The  decision  maker  can  sense  the 
state  of  the  light  in  the  current  room  (on  or  off).  This  environment  can  be  modeled  in 
the  obvious  way  by  a  finite-state  automaton  (FSA)  having  2n  states.  Although  one  could 
try  learning  an  unstructured  representation  of  this  FSA  by  estimating  its  state  transition 
function,  it  often  is  not  efficient  to  do  so  because  the  unstructured  FSA  representation 
does  not  capture  redundancy  inherent  in  the  environment.  For  example,  in  this  n-rootn 
environment,  although  the  sensation  resulting  from  toggling  the  light  switch  is  dependent 
on  only  the  state  of  the  current  room,  in  the  unstructured  FSA  representation,  knowledge 


21 


about  “toggle”  must  be  encoded  for  each  of  the  2n  distinct  states.  The  environment,  has 
symmetries  not  represented  in  the  unstructured  FSA  representation.  Rather  than  trying 
to  learn  the  FSA  in  unstructured  form,  Rivest  and  Schapire  suggest  learning  another 
representation  called  an  update  graph.  The  advantage  of  the  update  graph  representation 
is  that  in  environments  with  many  regularities,  the  number  of  nodes  in  the  update  graph 
can  be  much  less  than  the  number  of  states  of  the  FSA  (e.g.,  2n  versus  2n  for  the  n-room 
world).  The  update  graph  is  a  particular  structured  representation  of  the  FSA  in  which 
each  state  is  represented  by  a  pattern  of  activity  across  the  nodes  of  the  graph.  In  other 
words,  the  update  graph  provides  a  particular  distributed  representation  of  environmental 
states. 

The  update  graph  representation  is  based  on  the  notion  of  a  test.  A  test  consists 
of  a  sequence  of  zero  or  more  actions  followed  by  the  application  of  a  predicate  that  is 
true  for  a  particular  sensation.  A  test  is  performed  by  executing  the  sequence  of  actions 
from  the  current  environmental  state  and  then  checking  for  the  presence  or  absence  of 
the  sensation.  Certain  tests  will  always  yield  the  same  truth  value  independently  of 
the  current  environmental  state.  For  example,  toggling  the  light  switch  four  times  has 
exactly  the  same  effect  as  toggling  the  switch  two  or  zero  times.  Such  tests  are  equivalent , 
and  there  is  a  node  in  the  update  graph  representation  for  each  equivalent  class  of  tests. 
Each  directed  arc  of  the  update  graph  is  labeled  with  an  action.  There  is  an  arc  directed 
from  node  a  to  node  (3  labeled  action  if  the  test  resulting  from  executing  any  test  in  the 
equivalence  class  (3  followed  by  executing  action  is  in  the  equivalence  class  represented 
by  node  a.  Associated  with  each  node  is  a  binary  variable  giving  the  truth  value  of 
the  corresponding  test  given  the  current  environmental  state.  If  the  current  values  of 
all  nodes  are  known,  then  the  values  after  executing  an  action  can  be  inferred  from  the 
update  graph:  The  value  of  node  (3  following  action  is  equal  to  the  current  value  of  the 
node  a  connected  to  0  with  the  link  labeled  action.  Thus,  the  sensations  obtained  after 
performing  a  sequence  of  actions  can  be  predicted  simply  by  shifting  values  around  in 
the  update  graph.  The  update  graph  serves  as  a  structured  model  of  the  environment. 

Following  this  work  of  Schapire  and  Rivest  [27],  Bachrach  in  collaboration  with  M. 
Mozer  of  the  University  of  Colorado,  devised  a  network  architecture  that  learns  to  perform 
as  an  update  graph  (Figure  6).  Each  unit  in  the  network  corresponds  to  a  node  in 
the  update  graph.  The  binary-valued  activity  of  a  unit  corresponds  to  the  truth  value 
of  a  node.  Connections  between  units  are  gated  by  a  set  of  gating  units  such  that 
the  connection  is  enabled  only  if  the  given  action  is  performed  by  the  organism;  this 
corresponds  to  the  labeled  links  between  nodes  of  the  update  graph. 

Training  networks  of  the  form  shown  in  Figure  6  to  represent  update  graphs  relies 
on  performing  error  back-propagation  through  time  [30]  while  the  t he  decision  maker 
is  interacting  with  the  environment  using  a  random  policy.  In  simple  environments, 
the  connectionist  update  graph  outperforms  the  RS  algorithm  even  though  the  action 
sequence  used  to  train  the  network  is  generated  at,  random,  whereas  the  RS  algorithm 
uses  a  specific  exploration  strategy.  We  conjecture  that  the  network  does  as  well  as  it 
does  because  it  considers  and  updates  many  hypotheses  in  parallel  at  each  time  step. 
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Figure  6:  Network  architecture  for  learning  update-graph  representations 
of  finite-state  environments. 
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The  network  is  also  able  to  construct  models  of  stochastic  environments.  For  example, 
if  the  sensations  in  the  3-room  world  are  slightly  unreliable,  the  network  still  learns  the 
task.  The  RS  algorithm  cannot  handle  nondeferminism.  In  more  complex  environments, 
however,  the  network  does  not  perform  as  well.  For  example,  it  failed  to  learn  a  32-room 
environment,  whereas  the  RS  algorithm  succeeded.  An  intelligent  exploration  strategy 
seems  necessary  in  this  case. 

In  order  to  further  develop  these  and  other  ideas,  a  test-bed  was  designed  and  im¬ 
plemented  for  studying  them  as  applied  to  spatial  navigation  problems.  This  test-bed  is 
described  in  the  next  section. 

9  Spatial  Navigation  Test-Bed 

A  wide  variety  of  sequential  decision  tasks  can  be  formulated  in  terms  of  moving  in 
spatial  environments  while  receiving  sensory  information  providing  clues  as  to  location 
and  orientation.  Some  of  our  initial  explorations  of  reinforcement  learning  networks  were 
conducted  in  this  domain  [7,  5],  and  we  have  continued  to  find  this  a  good  domain  for 
posing  problems  and  investigating  solution  methods  (a  recent  example  is  described  in 
ref.  [9]).  Following  is  a  brief  description  of  a  test-bed  we  implemented  that  will  allow  us 
to  address  basic  learning  issues  while  at  the  same  time  provide  us  with  fairly  realistic 
simulations  of  robot  navigation  tasks. 

The  system  simulates  a  cylindrical  robot  with  four  wheels  and  a  360°  sensor  belt. 
The  simulated  robot  can  translate  independently  and  simultaneously  in  both  the  x  and 
y  directions  relative  to  its  orientation.  The  motion  of  the  robot  is  simulated  as  discrete 
movements,  one  per  time  step.  The  robot  has  16  distance  sensors  and  16  grey-scale 
sensors  evenly  placed  around  its  perimeter.  The  distance  sensors  roughly  simulate  sonar. 
Information  from  these  simulated  sensors  is  processed  to  yield  16  distance  values  and  16 
grey-scale  sensor  values  which  measure  the  intensity  of  light  at  the  various  orientations. 

Figure  7  shows  a  display  created  by  the  navigation  simulator.  The  bottom  portion 
of  the  figure  shows  the  robot’s  environment  as  seen  from  above.  In  this  display,  the 
bold  circle  represents  the  robot’s  “home”  position,  with  the  radius  line  indicating  the 
home  orientation  for  a  homing  task.  The  other  circle  with  radius  line  represents  the 
robot’s  current  position  and  orientation.  The  topmost  canvas  shows  the  grey-scale  view 
from  the  home  position  and  orientation,  and  (he  next  canvas  shows  the  grey-scale  view 
from  the  robot’s  current  position  and  orientation.  The  third  canvas  from  the  top  shows 
smoothed  distance  values  from  both  positions  and  orientations,  with  those  from  home 
shown  in  horizontal  stripes  and  those  from  current  shown  with  vertical  stripes.  The 
fourth  canvas  shows  smoothed  grey-scale  images  for  both  the  home  and  current  positions 
and  orientiations.  The  fifth  panel  from  the  top  of  the  figure  shows  the  actions  of  the 
robot,  from  left  to  right,  x,  y,  and  rotation. 

This  test-bed  will  be  used  by  Bachrach  for  extending  the  approach  to  constructing 
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Figure  7:  Computer  display  generated  by  the  navigation  simulator. 
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environmental  models  described  abo’^e  in  Section  8.  He  will  study  homing  tasks  in  which 
several  positions  and  orientations  produce  the  same  sensory  stimulation. 

10  Conclusion 

Progress  was  made  in  the  development  of  connectionist  learning  methods  permitting 
networks  to  learn  when  they  cannot  be  provided  with  training  information  of  the  high 
quality  required  by  supervised-learniog  methods.  These  methods  can  permit  the  applica¬ 
tion  of  adaptive  connectionist  networks  to  a  tasks  involving  complex  dynamical  behavior 
and  high  degrees  of  uncertainty.  The  various  projects  undertaken  with  the  support  of 
this  grant  were  all  motivated  by  issues  related  to  the  control  by  networks  of  dynamical 
systems.  It  is  apparent  that  connectionist  techniques  can  substantively  contribute  to 
the  theory  and  practice  of  control  of  nonlinear  dynamical  systems  with  many  degrees-of- 
freedom.  Future  research  will  be  directed  toward  studying  these  methods  as  applied  to 
a  variety  of  simulated  and  real  control  tasks. 
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11.3  Participating  Professionals 


Following  is  a  list  of  professionals  who  participated  directly  in  research  partially  or  com¬ 
pletely  funded  by  AFOSR-87-0030,  or  closely  related  research. 

Dr.  R.  S.  Sutton,  GTE  Laboratories  Incorporated,  Waltham,  MA.  Dr.  Sutton,  formerly 
a  student  of  Barto  whose  Ph.D.  research  was  supported  by  previous  AFOSR  grants,  has 
continued  to  interact  closely  with  Barto  and  students  funded  by  AFOSR-87-0030.  In  the 
period  being  reported  here,  Sutton  and  Barto  interacted  in  writing  a  conference  paper 
(ref.  [34])  and  two  pook  chapters  (refs.  [33,  9]).  Dr.  Sutton  has  served  as  a  member  of 
the  Master’s  committees  of  several  students  funded  by  AFOSR-87-U030. 

Consultation  with  Dr.  P.  S.  Sastry,  Indian  Institute  of  Science,  Bangalore,  India,  on 
stochastic  convergence  theory  of  Ar-p  and  related  algorithms.  Period:  5/11/88-6/19/88. 
Barto  and  Sastry  are  continuing  to  correspond  regarded  this  topic. 

Dr.  M.  I.  Jordan,  Department  of  Brain  and  Cognitive  Sciences,  Massachusetts  Institute 
of  Technology,  Cambridge,  MA.  Dr.  Jordan  was  supported  as  a  Post-Doctoral  Research 
Associate  by  AFOSR-87-0030  until  he  began  his  current  position  as  Assistent  Professor 
at  MIT  in  January,  1988.  Since  that  time,  he  has  maintained  active  interaction  with 
researchers  funded  by  AFOSR-87-0030  and  is  serving  on  the  Ph.D.  committees  of  several 
of  the  graduate  students  supported  by  this  grant. 

Interaction  with  C.  J.  C.  H.  Watkirs,  Philips  Research  Laboratories,  Cross  Oak  Lane, 
Redhill  Surrey  RHl  5HA,  England.  While  employed  at  Philips,  Watkins  pursued  a  Ph.D. 
in  Psychology  at  the  University  of  Cambridge,  Cambridge,  England.  His  dissertation, 
completed  in  June,  1989,  concerns  the  problems  of  learning  with  delayed  rewards,  a  topic 
partially  inspired  by  the  AFOSR  funded  research  of  our  group  on  reinforcement  learning. 
Watkins  elaborated  connections  between  our  approach  and  concepts  and  computational 
methods  from  the  theory  of  stochastic  dynamic  programming.  Watkins  collaborated  with 
Barto  and  Sutton  in  producing  the  technical  report  on  this  subject,  ref.  [9],  due  to  appear 
as  a  book  chapter. 

Interaction  with  Dr.  J.  C.  Houk,  Chairman,  Department  of  Physiology,  Northwestern 
University  Medical  Center,  Chicago,  Illinois.  Dr.  Houk  began  a  sabbatical  semester  as  a 
Visiting  Professor  in  the  Department  of  Computer  Science,  University  of  Massachusetts, 
in  Sept.  1988.  He  is  Principal  Investigator  of  ONR  Grant  N00014-88-K0339,  which  is 
supporting  a  computer  science  gradvate  student  at  the  University  of  Massachusetts  and 
which  lists  Barto  as  a  consultant.  This  project  is  directed  toward  constructing  a  model 
of  the  cerebellum  as  a  trainable  pattern  generator,  and  is  closely  related  to  the  research 
supported  by  AFOSR  reported  here. 
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11.4  Advanced  Degrees 

Following  is  a  list  of  advanced  degrees  awarded  to  students  who  were  partially  or  com¬ 
pletely  supported  by  AFOSR-87-0030  while  graduate  students  in  the  Department  of 
Computer  Science,  University  of  Ma  isachusetts. 

J.  S.  Judd  was  awarded  the  Ph.D.  Degree  in  Computer  and  Information  Science  in 
September,  1988,  for  research  supported  by  AFOSR-87-0030  and  a  previous  AFOSR 
grant.  His  dissertation  is  entitled  “Neural  Network  Design  and  the  Complexity  of  Learn¬ 
ing.”  A  revised  version  of  the  dissertation  will  be  published  in  book  form  by  the  MIT 
Press.  Dr.  Judd  is  currently  an  Adjunct  Assistant  Professor  of  Computer  and  Informa¬ 
tion  Science,  University  of  Massachusetts,  and  is  a  Visiting  Professor  at  the  California 
Institute  of  Technology,  Pasadena,  CA,  where  he  is  continuing  this  line  of  research. 

R.  A.  Jacobs  was  awarded  the  M.S.  Degree  in  May,  1987.  His  M.S.  project  was  entitled 
“Increased  Rates  of  Convergence  Through  Learning  Rate  Adaptation.”  A  paper  based 
on  this  project  appeared  in  the  journal  Neural  Networks.  Jacobs  is  currently  working  on 
a  Ph.D.  dissertation  on  modular  network  architectures  which  is  expected  to  be  complete 
in  the  Spring  of  1990. 

V.  Gullapalli  was  awarded  the  M.S.  Degree  in  May,  1988.  His  M.S.  project  was  entitled 
“Stochastic  Reinforcement  Learning  in  Motor  Control”  A  paper  based  on  this  project  is 
currently  in  review  for  the  journal  Ntural  Networks.  Gullapalli  is  currently  working  on  a 
Ph.D.  dissertation  on  applying  reinforcement  learning  to  motor  control  problems  which 
is  expected  to  be  complete  in  the  Fall  of  1990  or  the  Spring  of  1991. 

J.  R.  Bachrach  was  awarded  the  M.S.  Degree  in  December,  1988.  His  M.S.  project  was 
entitled  “Learning  to  Represent  Stato.”  Bachrach  is  currently  working  on  a  Ph.D.  disser¬ 
tation  on  this  same  subject  which  is  expected  to  be  complete  in  the  Spring  or  Summer 
or  1990. 

V.  Bauer  completed  an  M.S.  project  with  partial  support  from  the  grant  in  August,  1989. 
Her  project  was  entitled  “Effect  of  Discounting  on  Rate  of  Convergence  in  Temporal 
Difference  Learning.”  Bauer  will  receive  the  M.S.  degree  in  December,  1989,  and  will  not 
pursue  a  Ph.D.  degree  at  the  preseni  time. 
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